Skip to content

feat(endpoints): Add OpenAI Responses API endpoint with fixes and integration tests#43

Open
acere wants to merge 32 commits intoawslabs:mainfrom
acere:ResponseAPI
Open

feat(endpoints): Add OpenAI Responses API endpoint with fixes and integration tests#43
acere wants to merge 32 commits intoawslabs:mainfrom
acere:ResponseAPI

Conversation

@acere
Copy link
Copy Markdown
Collaborator

@acere acere commented Mar 25, 2026

Summary

Adds the OpenAI Responses API endpoint support for LLMeter, with fixes to align with the actual API behavior.

Changes

Endpoint fixes (llmeter/endpoints/openai_response.py)

  • Rename max_tokens to max_output_tokens in create_payload (Response API parameter name)
  • Fix _parse_response to handle usage=None (Bedrock Mantle doesn't always return it) and use input_tokens/output_tokens with fallback to prompt_tokens/completion_tokens
  • Rewrite _parse_stream_response to process typed events (response.output_text.delta, response.completed) instead of the old chunk-with-output-array format

Integration tests

  • Add tests/integ/test_response_endpoint.py — integration tests for ResponseEndpoint and ResponseStreamEndpoint wrappers against Bedrock Mantle
  • Fix tests/integ/test_response_bedrock.py to use ResponseUsage attribute names (input_tokens/output_tokens)

Unit test updates

  • Update all unit test mocks across 5 test files to use spec-based usage mocks (input_tokens/output_tokens) and event-based streaming mocks

Example notebook

  • Add examples/LLMeter with OpenAI Response API on Bedrock.ipynb demonstrating non-streaming and streaming usage with Runner and plotting

Testing

  • All 527 unit tests pass
  • Ruff lint clean

acere added 2 commits March 24, 2026 21:41
… test suite

- Add ResponseEndpoint and ResponseStreamEndpoint classes for OpenAI Responses API support
- Implement non-streaming and streaming response handling with proper error management
- Add structured output support with response format validation and serialization
- Create comprehensive unit test suite covering response parsing, error handling, format validation, model parameters, payload parsing, properties, and serialization
- Add integration tests for Bedrock response endpoint functionality
- Export new response endpoint classes from endpoints module
- Update integration test configuration with response endpoint fixtures
- Rename max_tokens to max_output_tokens in create_payload (Response API
  parameter name)
- Fix _parse_response to handle usage=None (Bedrock Mantle) and use
  input_tokens/output_tokens with fallback to prompt_tokens/completion_tokens
- Rewrite _parse_stream_response to process typed events
  (response.output_text.delta, response.completed) instead of the old
  chunk-with-output-array format
- Fix test_response_bedrock.py to use ResponseUsage attribute names
  (input_tokens/output_tokens)
- Add integration tests for ResponseEndpoint and ResponseStreamEndpoint
- Add example notebook for Response API on Bedrock
- Update all unit test mocks to match new behavior
@acere acere requested a review from athewsey March 25, 2026 01:50
@acere acere self-assigned this Mar 25, 2026
Comment on lines +16 to +17
ResponseEndpoint,
ResponseStreamEndpoint,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be OpenAIResponseEndpoint and OpenAIResponseStreamEndpoint for consistency with the existing ChatCompletion ones? "ResponseEndpoint" seems very generic.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. updated

try:
client_response = self._client.responses.create(**payload)
except APIConnectionError as e:
logger.error(e)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In bedrock_invoke and litellm we're using logger.exception(e), which also prints the stack trace... I'd suggest we standardize on one or the other when handling endpoint invocation errors into InvocationResponse.error_outputs?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go logger.exception(e)!

max_output_tokens: int = 256,
instructions: str | None = None,
**kwargs,
) -> Dict:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike boto3, the OpenAI Python SDK has pretty solid (and TypedDict-based) typings already... Should we even be creating this convenience method in LLMeter? Or just typing payload as ResponseCreateParams for this endpoint and encouraging users to build it via the OpenAI SDK directly?

(Same logic would apply to the existing ChatCompletions endpoint too)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated all openAI endpoint classes to leverage the SDK typing. that simplified some of the parsing gymnastics. I'm not in favor of sun-setting create_payload. it's not a hard requirement to create payloads using this method, but it offers an easy consistent way to create tests across providers.

# Configure OpenAI client with Bedrock Mantle endpoint for Response API
# Response API uses bedrock-mantle endpoint, not bedrock-runtime
base_url = f"https://bedrock-mantle.{aws_region}.api.aws/v1"
client = OpenAI(api_key=token, base_url=base_url)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is just testing the OpenAI SDK and not the LLMeter endpoint??

Same for the streaming test below too

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed... fixed now.

@athewsey
Copy link
Copy Markdown
Collaborator

athewsey commented Apr 6, 2026

Also almost forgot - we should add the relevant module placeholder .md under docs api reference

acere and others added 24 commits April 8, 2026 22:40
- Replace Poetry with uv in GitHub Actions PyPI workflow for faster builds
- Update .gitignore to track uv.lock instead of poetry.lock
- Migrate pyproject.toml from Poetry format to standard PEP 621 format
- Update CONTRIBUTING.md with uv installation and development instructions
- Update README.md with uv installation examples for both basic and extras
- Simplify dependency management and build configuration
- Improve CI/CD performance and developer experience with uv tooling
…poetry in the documentation.

update test documentation.
- Upgrade astral-sh/setup-uv action from v4 to v7
- Update Python version requirement from <3.13 to <4 in pyproject.toml
- Add reference to tests/README.md in CONTRIBUTING.md for testing documentation
- Align with uv package manager migration and improve version flexibility
Use importlib.metadata to read the version from installed package
metadata, with a fallback to "0.0.0" when the package is not formally
installed. This fixes `AttributeError: module 'llmeter' has no
attribute '__version__'`.
Use the __name__ variable to retrieve LLMeter's version from
importlib, rather than hard-coding the module's name.
Update test payloads and JMESPath expressions in test_bedrock_invoke.py
to match Amazon Nova's native Invoke API format, since the default
BEDROCK_TEST_MODEL was changed from Claude to Nova in PR awslabs#36.

- Non-streaming: use output.message.content[0].text, usage.outputTokens
- Streaming: use contentBlockDelta.delta.text, metadata.usage.*Tokens
- Request payload: use schemaVersion messages-v1 and inferenceConfig

Fixes awslabs#38
onnxruntime 1.24.3 dropped Python 3.10 support, causing the release
workflow to fail. Bump the build environment to Python 3.12.
uv build only needs the build backend (hatchling), which it resolves
on its own. Installing all dev/test dependencies is unnecessary and
was pulling in onnxruntime which lacks Python 3.10 wheels.
Still lots of gaps to fill in
…nd fix build warnings

- Add metrics and statistics page with LLM latency concepts (TTFT, TTLT, TPOT),
  percentile reliability guidance, run-level stats, cost metrics, and visualization examples
- Add API reference pages for callbacks (base, cost, mlflow) and bedrock_invoke endpoint
- Update installation page with uv instructions, mlflow extra, OpenAI-compatible API description
- Fix broken relative links in index.md and key_concepts.md
- Add type annotations to fix all griffe warnings in mkdocs build
- Fix docstring issues (parameter name mismatch, indentation) in base.py and runner.py
- Pin mkdocs<2 to avoid incompatible upstream changes
- Add callbacks card to homepage
Move overall homepage within the User Guide instead of a confusing
separate tab. Add an API Reference home page.
We don't have github discussions enabled anyway
Add headers to module pages so they don't appear as 'index'. Add some
clarifying text to API reference home page. Add some missing pages
and fix associated griffe type warnings. Improve some docstrings.
As discussed at https://fpgmaas.com/blog/collapse-of-mkdocs/, MkDocs
has been unmaintained for some time and the new v2 will not support
Material for MkDocs that we used to use for theming. Migrate to
Zensical, a project by the authors of Material for MkDocs team that
aims to offer easy compatibility. Also, update the docs GitHub
workflow to reflect our moves Poetry->UV and MkDocs->Zensical.
Include section in contributing file to guide devs on how to preview
and maintain the documentation website.
Remove custom analytics placeholder page. Fill out 'run experiments'
placeholder page. Move unnecessarily folder-nested user guide pages
up to the root (URL won't change if we folder them again in future
when we have more content).
Add push trigger for main branch with path filters on docs/** and
mkdocs.yml so documentation updates are deployed without waiting for
a release.
…dencies

The docs build only needs mkdocstrings and zensical. Using --only-group
instead of --group skips the main project dependencies (torch, mlflow,
nvidia packages, etc.) that are not needed for static doc generation.
The deploy-pages action requires id-token: write to obtain the
ACTIONS_ID_TOKEN_REQUEST_URL needed for authentication.
- Include .github/workflows/docs.yml in path filter so workflow
  changes also trigger a docs build
- Add id-token: write permission required by deploy-pages action
- Use --only-group docs to skip unnecessary main dependencies
- Add environment declaration required by deploy-pages action
- Use --frozen on uv sync and --no-sync on uv run to prevent
  re-installing the full project dependencies during build
acere and others added 6 commits April 8, 2026 22:40
- Configure mkdocstrings Python handler with Google-style docstring
  parsing, source links, cross-references, and merged __init__ docs
- Add missing prompt_utils API reference page and nav entry
- Fix table column width issues causing awkward word splits in code
  tokens by keeping inline code on one line and setting min-width on
  description columns
- Update CONTRIBUTING.md with lightweight docs build instructions
  using uv sync --only-group docs
Going back to sorting attributes alphabetically in the API doc for
easier searching.
uv version without --no-sync modifies pyproject.toml and triggers an
automatic sync, resolving and installing all 280+ dependencies
unnecessarily in the publish workflow.
- Rename ResponseEndpoint -> OpenAIResponseEndpoint and
  ResponseStreamEndpoint -> OpenAIResponseStreamEndpoint for
  consistency with OpenAICompletionEndpoint naming convention
- Change logger.error() to logger.exception() for stack trace
  consistency with bedrock_invoke.py and litellm.py
- Rewrite test_response_bedrock.py to test LLMeter endpoint wrappers
  instead of raw OpenAI SDK
- Update serialization test assertions for new class names
- Update example notebook references
- Add docs/reference/endpoints/openai_response.md placeholder
- Add openai_response to mkdocs.yml nav under endpoints
- Update connect_endpoints user guide to mention Response API endpoints
- Type invoke() payload as CompletionCreateParams / ResponseCreateParams
- Type create_payload() return as SDK TypedDicts using cast()
- Replace jmespath with plain list comprehension in _parse_payload
- Rewrite stream parsers using typed ChatCompletionChunk / event types,
  removing all hasattr/getattr fallbacks and type: ignore comments
- Make OpenAIResponseStreamEndpoint inherit from OpenAIResponseEndpoint,
  deduplicating _parse_payload and create_payload
- Use collections.abc.Sequence instead of typing.Sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants